57 research outputs found
Efficient Modeling of Future Context for Image Captioning
Existing approaches to image captioning usually generate the sentence
word-by-word from left to right, with the constraint of conditioned on local
context including the given image and history generated words. There have been
many studies target to make use of global information during decoding, e.g.,
iterative refinement. However, it is still under-explored how to effectively
and efficiently incorporate the future context. To respond to this issue,
inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage
two-side relation with modified mask operation, we aim to graft this advance to
the conventional Autoregressive Image Captioning (AIC) model while maintaining
the inference efficiency without extra time cost. Specifically, AIC and NAIC
models are first trained combined with shared visual encoders, forcing the
visual encoder to contain sufficient and valid future context; then the AIC
model is encouraged to capture the causal dynamics of cross-layer interchanging
from NAIC model on its unconfident words, which follows a teacher-student
paradigm and optimized with the distribution calibration training objective.
Empirical evidences demonstrate that our proposed approach clearly surpass the
state-of-the-art baselines in both automatic metrics and human evaluations on
the MS COCO benchmark. The source code is available at:
https://github.com/feizc/Future-Caption.Comment: ACM Multimedia 202
Divide and Adapt: Active Domain Adaptation via Customized Learning
Active domain adaptation (ADA) aims to improve the model adaptation
performance by incorporating active learning (AL) techniques to label a
maximally-informative subset of target samples. Conventional AL methods do not
consider the existence of domain shift, and hence, fail to identify the truly
valuable samples in the context of domain adaptation. To accommodate active
learning and domain adaption, the two naturally different tasks, in a
collaborative framework, we advocate that a customized learning strategy for
the target data is the key to the success of ADA solutions. We present
Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target
instances into four categories with stratified transferable properties. With a
novel data subdivision protocol based on uncertainty and domainness, DiaNA can
accurately recognize the most gainful samples. While sending the informative
instances for annotation, DiaNA employs tailored learning strategies for the
remaining categories. Furthermore, we propose an informativeness score that
unifies the data partitioning criteria. This enables the use of a Gaussian
mixture model (GMM) to automatically sample unlabeled data into the proposed
four categories. Thanks to the "divideand-adapt" spirit, DiaNA can handle data
with large variations of domain gap. In addition, we show that DiaNA can
generalize to different domain adaptation settings, such as unsupervised domain
adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain
adaptation (SFDA), etc.Comment: CVPR2023, Highlight pape
Progressive Denoising Model for Fine-Grained Text-to-Image Generation
Recently, vector quantized autoregressive (VQ-AR) models have shown
remarkable results in text-to-image synthesis by equally predicting discrete
image tokens from the top left to bottom right in the latent space. Although
the simple generative process surprisingly works well, is this the best way to
generate the image? For instance, human creation is more inclined to the
outline-to-fine of an image, while VQ-AR models themselves do not consider any
relative importance of each component. In this paper, we present a progressive
denoising model for high-fidelity text-to-image image generation. The proposed
method takes effect by creating new image tokens from coarse to fine based on
the existing context in a parallel manner and this procedure is recursively
applied until an image sequence is completed. The resulting coarse-to-fine
hierarchy makes the image generation process intuitive and interpretable.
Extensive experiments demonstrate that the progressive model produces
significantly better results when compared with the previous VQ-AR method in
FID score across a wide variety of categories and aspects. Moreover, the
text-to-image generation time of traditional AR increases linearly with the
output image resolution and hence is quite time-consuming even for normal-size
images. In contrast, our approach allows achieving a better trade-off between
generation quality and speed.Comment: Technique report. arXiv admin note: text overlap with
arXiv:2206.10789 by other author
Towards Efficient Sparse Coding for Scalable Image Annotation
10.1145/2502081.2502127MM 2013 - Proceedings of the 2013 ACM Multimedia Conference947-95
Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
Panoptic narrative grounding (PNG) aims to segment things and stuff objects
in an image described by noun phrases of a narrative caption. As a multimodal
task, an essential aspect of PNG is the visual-linguistic interaction between
image and caption. The previous two-stage method aggregates visual contexts
from offline-generated mask proposals to phrase features, which tend to be
noisy and fragmentary. The recent one-stage method aggregates only pixel
contexts from image features to phrase features, which may incur semantic
misalignment due to lacking object priors. To realize more comprehensive
visual-linguistic interaction, we propose to enrich phrases with coupled pixel
and object contexts by designing a Phrase-Pixel-Object Transformer Decoder
(PPO-TD), where both fine-grained part details and coarse-grained entity clues
are aggregated to phrase features. In addition, we also propose a PhraseObject
Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push
away unmatched ones for aggregating more precise object contexts from more
phrase-relevant object tokens. Extensive experiments on the PNG benchmark show
our method achieves new state-of-the-art performance with large margins.Comment: Accepted by IJCAI 202
Recommended from our members
RNF169 limits 53BP1 deposition at DSBs to stimulate single-strand annealing repair
Unrestrained 53BP1 activity at DNA double-strand breaks (DSBs) hampers DNA end resection and upsets DSB repair pathway choice. RNF169 acts as a molecular rheostat to limit 53BP1 deposition at DSBs, but how this fine balance translates to DSB repair control remains undefined. In striking contrast to 53BP1, ChIP analyses of AsiSI-induced DSBs unveiled that RNF169 exhibits robust accumulation at DNA end-proximal regions and preferentially targets resected, RPA-bound DSBs. Accordingly, we found that RNF169 promotes CtIP-dependent DSB resection and favors homology-mediated DSB repair, and further showed that RNF169 dose-dependently stimulates single-strand annealing repair, in part, by alleviating the 53BP1-imposed barrier to DSB end resection. Our results highlight the interplay of RNF169 with 53BP1 in fine-tuning choice of DSB repair pathways
Circuit-wide Transcriptional Profiling Reveals Brain Region-Specific Gene Networks Regulating Depression Susceptibility
Depression is a complex, heterogeneous disorder and a leading contributor to the global burden of disease. Most previous research has focused on individual brain regions and genes contributing to depression. However, emerging evidence in humans and animal models suggests that dysregulated circuit function and gene expression across multiple brain regions drive depressive phenotypes. Here we performed RNA-sequencing on 4 brain regions from control animals and those susceptible or resilient to chronic social defeat stress at multiple time points. We employed an integrative network biology approach to identify transcriptional networks and key driver genes that regulate susceptibility to depressive-like symptoms. Further, we validated in vivo several key drivers and their associated transcriptional networks that regulate depression susceptibility and confirmed their functional significance at the levels of gene transcription, synaptic regulation and behavior. Our study reveals novel transcriptional networks that control stress susceptibility and offers fundamentally new leads for antidepressant drug discovery
Towards Multi-view and Partially-Occluded Face Alignment
We present a robust model to locate facial landmarks under different views and possibly severe occlusions. To build reliable relationships between face appearance and shape with large view variations, we propose to formu-late face alignment as an `1-induced Stagewise Relational Dictionary (SRD) learning problem. During each training stage, the SRD model learns a relational dictionary to cap-ture consistent relationships between face appearance and shape, which are respectively modeled by the pose-indexed image features and the shape displacements for current es-timated landmarks. During testing, the SRD model auto-matically selects a sparse set of the most related shape dis-placements for the testing face and uses them to refine its shape iteratively. To locate facial landmarks under occlu-sions, we further propose to learn an occlusion dictionary to model different kinds of partial face occlusions. By deploy-ing the occlusion dictionary into the SRD model, the align-ment performance for occluded faces can be further im-proved. Our algorithm is simple, effective, and easy to im-plement. Extensive experiments on two benchmark datasets and two newly built datasets have demonstrated its superior performances over the state-of-the-art methods, especially for faces with large view variations and/or occlusions. 1
- …